Approximate String Matching with Variable Length Don ' t Care
نویسنده
چکیده
Searching for DNA or amino acid sequences similar to a given pattern string is very important in molecular biology. In fact, a lot of programs and algorithms have been developed. Most of them are based on alignment of strings or approximate string matching. However, they do not seem to be adequate in some cases. For example, the DNA pattern TATA (known as TATA box) is a common promoter that often appears after the pattern CAATCT (known as CAAT box) within 30 to 50 spaces [3], [6]. To nd strings containing such a pattern, string matching with variable length don't cares is required, where a variable length don't care character can match any substring whose length is in a speci ed range. Moreover, exact matching is not su cient but approximate string matching is required, since the patterns can appear with some probability of error. Thus this paper studies the problem of approximate string matching with variable length don't cares. Here, we brie y review the previous work. Exact string matching with don't cares was studied by Fisher and Paterson [4]. A lot of studies have been done for approximate string matching [5]. Myers and Miller studied approximate string matching of regular expressions [7], and Zhang, Shasha and Wang studied approximate tree matching with variable length don't cares [8]. Although these two studies are close to our problem, the ranges of lengths of don't cares can not be speci ed. Exact string matching with variable length don't cares was studied by Manber and Baeza-Yates [6]. In their work, the range can be speci ed, but approximate matching was not considered. We also developed an algorithm for approximate string matching with don't cares [1]. However, variable length don't cares can not be treated.
منابع مشابه
A Parallel Algorithm for Fixed-Length Approximate String-Matching with k-mismatches
This paper deals with the approximate string-matching problem with Hamming distance. The approximate string-matching with kmismatches problem is to find all locations at which a query of length m matches a factor of a text of length n with k or fewer mismatches. The approximate string-matching algorithms have both pleasing theoretical features, as well as direct applications, especially in comp...
متن کاملFaster Subsequence and Don't-Care Pattern Matching on Compressed Texts
Subsequence pattern matching problems on compressed text were first considered by Cégielski et al. (Window Subsequence Problems for Compressed Texts, Proc. CSR 2006, LNCS 3967, pp. 127–136), where the principal problem is: given a string T represented as a straight line program (SLP) T of size n, a string P of size m, compute the number of minimal subsequence occurrences of P in T . We present ...
متن کاملA Parallel Algorithm for the Fixed-length Approximate String Matching Problem for High Throughput Sequencing Technologies
The approximate string matching problem is to find all locations at which a query of length m matches a substring of a text of length n with k-or-fewer differences. Nowadays, with the advent of novel high throughput sequencing technologies, the approximate string matching algorithms are used to identify similarities, molecular functions and abnormalities in DNA sequences. We consider a generali...
متن کاملVGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams
Many applications need to solve the following problem of approximate string matching: from a collection of strings, how to find those similar to a given string, or the strings in another (possibly the same) collection of strings? Many algorithms are developed using fixed-length grams, which are substrings of a string used as signatures to identify similar strings. In this paper we develop a nov...
متن کاملApproximate String Matching by Finite Automata
Abs t r ac t . Approximate string matching is a sequential problem and therefore it is possible to solve it using finite automata. A nondeterministic finite automaton is constructed for string matching with k mismatches. It is shown, how "dynamic programming" and "shift-and" based algorithms simulate this nondeterministic finite automaton. The corresponding deterministic finite automaton have O...
متن کامل